traditional ml model
Training-Free Active Learning Framework in Materials Science with Large Language Models
Wang, Hongchen, Castañeda, Rafael Espinosa, Werber, Jay R., Fehlis, Yao, Kim, Edward, Hattrick-Simpers, Jason
Active learning (AL) accelerates scientific discovery by prioritizing the most informative experiments, but traditional machine learning (ML) models used in AL suffer from cold-start limitations and domain-specific feature engineering, restricting their generalizability. Large language models (LLMs) offer a new paradigm by leveraging their pretrained knowledge and universal token-based representations to propose experiments directly from text-based descriptions. Here, we introduce an LLM-based active learning framework (LLM-AL) that operates in an iterative few-shot setting and benchmark it against conventional ML models across four diverse materials science datasets. We explored two prompting strategies: one using concise numerical inputs suited for datasets with more compositional and structured features, and another using expanded descriptive text suited for datasets with more experimental and procedural features to provide additional context. Across all datasets, LLM-AL could reduce the number of experiments needed to reach top-performing candidates by over 70% and consistently outperformed traditional ML models. We found that LLM-AL performs broader and more exploratory searches while still reaching the optima with fewer iterations. We further examined the stability boundaries of LLM-AL given the inherent non-determinism of LLMs and found its performance to be broadly consistent across runs, within the variability range typically observed for traditional ML approaches. These results demonstrate that LLM-AL can serve as a generalizable alternative to conventional AL pipelines for more efficient and interpretable experiment selection and potential LLM-driven autonomous discovery.
Leveraging NTPs for Efficient Hallucination Detection in VLMs
Azachi, Ofir, Eliyahu, Kfir, Ani, Eyal El, Himelstein, Rom, Reichart, Roi, Pinter, Yuval, Calderon, Nitay
Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM's next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.
REACT-LLM: A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks
Wang, Linna, You, Zhixuan, Zhang, Qihui, Wen, Jiunan, Shi, Ji, Chen, Yimin, Wang, Yusen, Ding, Fanqi, Feng, Ziliang, Lu, Li
Large Language Models (LLMs) and causal learning each hold strong potential for clinical decision making (CDM). However, their synergy remains poorly understood, largely due to the lack of systematic benchmarks evaluating their integration in clinical risk prediction. In real-world healthcare, identifying features with causal influence on outcomes is crucial for actionable and trustworthy predictions. While recent work highlights LLMs' emerging causal reasoning abilities, there lacks comprehensive benchmarks to assess their causal learning and performance informed by causal features in clinical risk prediction. To address this, we introduce REACT-LLM, a benchmark designed to evaluate whether combining LLMs with causal features can enhance clinical prognostic performance and potentially outperform traditional machine learning (ML) methods. Unlike existing LLM-clinical benchmarks that often focus on a limited set of outcomes, REACT-LLM evaluates 7 clinical outcomes across 2 real-world datasets, comparing 15 prominent LLMs, 6 traditional ML models, and 3 causal discovery (CD) algorithms. Our findings indicate that while LLMs perform reasonably in clinical prognostics, they have not yet outperformed traditional ML models. Integrating causal features derived from CD algorithms into LLMs offers limited performance gains, primarily due to the strict assumptions of many CD methods, which are often violated in complex clinical data. While the direct integration yields limited improvement, our benchmark reveals a more promising synergy.
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
Chen, Canyu, Yu, Jian, Chen, Shan, Liu, Che, Wan, Zhongwei, Bitterman, Danielle, Wang, Fei, Shu, Kai
Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.
Generative AI-in-the-loop: Integrating LLMs and GPTs into the Next Generation Networks
Zhang, Han, Sediq, Akram Bin, Afana, Ali, Erol-Kantarci, Melike
In recent years, machine learning (ML) techniques have created numerous opportunities for intelligent mobile networks and have accelerated the automation of network operations. However, complex network tasks may involve variables and considerations even beyond the capacity of traditional ML algorithms. On the other hand, large language models (LLMs) have recently emerged, demonstrating near-human-level performance in cognitive tasks across various fields. However, they remain prone to hallucinations and often lack common sense in basic tasks. Therefore, they are regarded as assistive tools for humans. In this work, we propose the concept of "generative AI-in-the-loop" and utilize the semantic understanding, context awareness, and reasoning abilities of LLMs to assist humans in handling complex or unforeseen situations in mobile communication networks. We believe that combining LLMs and ML models allows both to leverage their respective capabilities and achieve better results than either model alone. To support this idea, we begin by analyzing the capabilities of LLMs and compare them with traditional ML algorithms. We then explore potential LLM-based applications in line with the requirements of next-generation networks. We further examine the integration of ML and LLMs, discussing how they can be used together in mobile networks. Unlike existing studies, our research emphasizes the fusion of LLMs with traditional ML-driven next-generation networks and serves as a comprehensive refinement of existing surveys. Finally, we provide a case study to enhance ML-based network intrusion detection with synthesized data generated by LLMs. Our case study further demonstrates the advantages of our proposed idea.
When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery
Xie, Yiqun, Wang, Zhihao, Chen, Weiye, Li, Zhili, Jia, Xiaowei, Li, Yanhua, Wang, Ruichen, Chai, Kangyang, Li, Ruohan, Skakun, Sergii
Foundation models, i.e., very large deep learning models, have demonstrated impressive performances in various language and vision tasks that are otherwise difficult to reach using smaller-size models. The major success of GPT-type of language models is particularly exciting and raises expectations on the potential of foundation models in other domains including satellite remote sensing. In this context, great efforts have been made to build foundation models to test their capabilities in broader applications, and examples include Prithvi by NASA-IBM, Segment-Anything-Model, ViT, etc. This leads to an important question: Are foundation models always a suitable choice for different remote sensing tasks, and when or when not? This work aims to enhance the understanding of the status and suitability of foundation models for pixel-level classification using multispectral imagery at moderate resolution, through comparisons with traditional machine learning (ML) and regular-size deep learning models. Interestingly, the results reveal that in many scenarios traditional ML models still have similar or better performance compared to foundation models, especially for tasks where texture is less useful for classification. On the other hand, deep learning models did show more promising results for tasks where labels partially depend on texture (e.g., burn scar), while the difference in performance between foundation models and deep learning models is not obvious. The results conform with our analysis: The suitability of foundation models depend on the alignment between the self-supervised learning tasks and the real downstream tasks, and the typical masked autoencoder paradigm is not necessarily suitable for many remote sensing problems.
Machine Learning vs Deep Learning: The Generalization Problem
Bay, Yong Yi, Yearick, Kathleen A.
The capacity to generalize beyond the range of training data is a pivotal challenge, often synonymous with a model's utility and robustness. This study investigates the comparative abilities of traditional machine learning (ML) models and deep learning (DL) algorithms in terms of extrapolation -- a more challenging aspect of generalization because it requires the model to make inferences about data points that lie outside the domain it has been trained on. We present an empirical analysis where both ML and DL models are trained on an exponentially growing function and then tested on values outside the training domain. The choice of this function allows us to distinctly showcase the divergence in performance when models are required to predict beyond the scope of their training data. Our findings suggest that deep learning models possess inherent capabilities to generalize beyond the training scope, an essential feature for real-world applications where data is often incomplete or extends beyond the observed range. This paper argues for a nuanced understanding of the structural differences between ML and DL models, with an emphasis on the implications for both theoretical research and practical deployment.
CARNA: Characterizing Advanced heart failure Risk and hemodyNAmic phenotypes using learned multi-valued decision diagrams
Lamp, Josephine, Wu, Yuxin, Lamp, Steven, Afriyie, Prince, Bilchick, Kenneth, Feng, Lu, Mazimba, Sula
Early identification of high risk heart failure (HF) patients is key to timely allocation of life-saving therapies. Hemodynamic assessments can facilitate risk stratification and enhance understanding of HF trajectories. However, risk assessment for HF is a complex, multi-faceted decision-making process that can be challenging. Previous risk models for HF do not integrate invasive hemodynamics or support missing data, and use statistical methods prone to bias or machine learning methods that are not interpretable. To address these limitations, this paper presents CARNA, a hemodynamic risk stratification and phenotyping framework for advanced HF that takes advantage of the explainability and expressivity of machine learned Multi-Valued Decision Diagrams (MVDDs). This interpretable framework learns risk scores that predict the probability of patient outcomes, and outputs descriptive patient phenotypes (sets of features and thresholds) that characterize each predicted risk score. CARNA incorporates invasive hemodynamics and can make predictions on missing data. The CARNA models were trained and validated using a total of five advanced HF patient cohorts collected from previous trials, and compared with six established HF risk scores and three traditional ML risk models. CARNA provides robust risk stratification, outperforming all previous benchmarks. Although focused on advanced HF, the CARNA framework is general purpose and can be used to learn risk stratifications for other diseases and medical applications.
Revolutionizing the Edge with TinyML
We live in a world where even our day-to-day activities like checking the weather, scrolling through social media, or taking photos depend on machine learning (ML) models. Traditionally, ML models operate on the cloud. The technology is often used to conduct data management and data processing on the relevant data gathered from Internet of Things (IoT) devices connected to the cloud network. However, this traditional ML model and IoT ecosystem have their own drawbacks and problems, which should be resolved soon to transform our world into a better-connected place. Let's take a look at the challenges posed by traditional cloud-based ML models and how'tiny' machine learning (tinyML) can help resolve them.